-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
iceberg
table format support for filesystem
destination
#2067
Conversation
- mypy upgrade needed to solve this issue: apache/iceberg-python#768 - uses <1.13.0 requirement on mypy because 1.13.0 gives error - new lint errors arising due to version upgrade are simply ignored
…-iceberg-filesystem
✅ Deploy Preview for dlt-hub-docs canceled.
|
✅ Deploy Preview for dlt-hub-docs ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
@rudolfix / @sh-rp this PR isn't there yet but can you give an intermediate review? What still needs to happen:
Notes:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this is almost good.
- we need to decide if we want to save per table catalog.
- I'd enable iceberg scanner in duckdb/sql_client in filesystem. mostly to start running the same tests that use delta
- maybe a refactor of
get_catalog
I mentioned.
dlt/common/libs/pyiceberg.py
Outdated
"""Returns single-table, ephemeral, in-memory Iceberg catalog.""" | ||
|
||
# create in-memory catalog | ||
catalog = SqlCatalog( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: how we get a catalog should be some kind of plugin. so we can easily plug glue or rest to filesystem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code is good and ready to merge. we also have sufficient tests. we still need a little bit better docs.
pyiceberg.io:__init__.py:348 Defaulting to PyArrow FileIO
maybe we could hide this log? we set log levels during tests inpytest_configure
there are some tests that do not work in https://github.com/dlt-hub/dlt/actions/runs/12096164907/job/33730150792?pr=2067#step:8:4106 pls take a look(they are coming from somewhere else)- we need help with improving the docs. my take would be to create a separate destination Delta / Iceberg where we could move most of the docs out of the filesystem WDYT?
- for both table formats you could write shortly how they are implemented
- in case of iceberg we also should describe how we write tables without catalog and mention the limitations (single writer!)
- we have a page where we enumerate table formats. I'll write some kind of introduction there next week
elif schema_table.get("table_format") == "iceberg": | ||
from dlt.common.libs.pyiceberg import _get_last_metadata_file | ||
|
||
self._setup_iceberg(self._conn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't the latest version of duckdb iceberg working with it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both issues mentioned in _setup_iceberg
are still open. I just tried with duckdb=1.1.3
, and it fails without _setup_iceberg
.
…-iceberg-filesystem
@rudolfix I've addressed your feedback. Please have a look. Only thing I don't understand are these failing tests:
I can't reproduce them, it works on my machine. It's strange because, looking at the logs, it looks as if def _resolve_config_field(
...
# if value is resolved, then deserialize and coerce it
if value is not None:
# do not deserialize explicit values
if value is not explicit_value:
value = deserialize_value(key, value, inner_hint) I'm probably overlooking something. Do you know what's going on? |
@jorritsandbrink that got fixed on another branch yesterday. so we are good. for some reason docs linting is not passing. pls check it out |
@rudolfix fixed it by deleting Error was:
on Don't think that error came from this branch. |
…-iceberg-filesystem
…-iceberg-filesystem
@jorritsandbrink now it all LGTM! just here: I think we are importing iceberg to aggressively into filesystem. we should allow it to work both without delta and iceberg. please check it out! |
@rudolfix I don't think we're importing |
…-hub/dlt into feat/1996-iceberg-filesystem
* add pyiceberg dependency and upgrade mypy - mypy upgrade needed to solve this issue: apache/iceberg-python#768 - uses <1.13.0 requirement on mypy because 1.13.0 gives error - new lint errors arising due to version upgrade are simply ignored * extend pyiceberg dependencies * remove redundant delta annotation * add basic local filesystem iceberg support * add active table format setting * disable merge tests for iceberg table format * restore non-redundant extra info * refactor to in-memory iceberg catalog * add s3 support for iceberg table format * add schema evolution support for iceberg table format * extract _register_table function * add partition support for iceberg table format * update docstring * enable child table test for iceberg table format * enable empty source test for iceberg table format * make iceberg catalog namespace configurable and default to dataset name * add optional typing * fix typo * improve typing * extract logic into dedicated function * add iceberg read support to filesystem sql client * remove unused import * add todo * extract logic into separate functions * add azure support for iceberg table format * generalize delta table format tests * enable get tables function test for iceberg table format * remove ignores * undo table directory management change * enable test_read_interfaces tests for iceberg * fix active table format filter * use mixin for object store rs credentials * generalize catalog typing * extract pyiceberg scheme mapping into separate function * generalize credentials mixin test setup * remove unused import * add centralized fallback to append when merge is not supported * Revert "add centralized fallback to append when merge is not supported" This reverts commit 54cd0bc. * fall back to append if merge is not supported on filesystem * fix test for s3-compatible storage * remove obsolete code path * exclude gcs read interface tests for iceberg * add gcs support for iceberg table format * switch to UnsupportedAuthenticationMethodException * add iceberg table format docs * use shorter pipeline name to prevent too long sql identifiers * add iceberg catalog note to docs * black format * use shorter pipeline name to prevent too long sql identifiers * correct max id length for sqlalchemy mysql dialect * Revert "use shorter pipeline name to prevent too long sql identifiers" This reverts commit 6cce03b. * Revert "use shorter pipeline name to prevent too long sql identifiers" This reverts commit ef29aa7. * replace show with execute to prevent useless print output * add abfss scheme to test * remove az support for iceberg table format * remove iceberg bucket test exclusion * add note to docs on azure scheme support for iceberg table format * exclude iceberg from duckdb s3-compatibility test * disable pyiceberg info logs for tests * extend table format docs and move into own page * upgrade adlfs to enable account_host attribute * Merge branch 'devel' of https://github.com/dlt-hub/dlt into feat/1996-iceberg-filesystem * fix lint errors * re-add pyiceberg dependency * enabled iceberg in dbt-duckdb * upgrade pyiceberg version * remove pyiceberg mypy errors across python version * does not install airflow group for dev * fixes gcp oauth iceberg credentials handling * fixes ca cert bundle duckdb azure on ci * allow for airflow dep to be present during type check --------- Co-authored-by: Marcin Rudolf <[email protected]>
* add pyiceberg dependency and upgrade mypy - mypy upgrade needed to solve this issue: apache/iceberg-python#768 - uses <1.13.0 requirement on mypy because 1.13.0 gives error - new lint errors arising due to version upgrade are simply ignored * extend pyiceberg dependencies * remove redundant delta annotation * add basic local filesystem iceberg support * add active table format setting * disable merge tests for iceberg table format * restore non-redundant extra info * refactor to in-memory iceberg catalog * add s3 support for iceberg table format * add schema evolution support for iceberg table format * extract _register_table function * add partition support for iceberg table format * update docstring * enable child table test for iceberg table format * enable empty source test for iceberg table format * make iceberg catalog namespace configurable and default to dataset name * add optional typing * fix typo * improve typing * extract logic into dedicated function * add iceberg read support to filesystem sql client * remove unused import * add todo * extract logic into separate functions * add azure support for iceberg table format * generalize delta table format tests * enable get tables function test for iceberg table format * remove ignores * undo table directory management change * enable test_read_interfaces tests for iceberg * fix active table format filter * use mixin for object store rs credentials * generalize catalog typing * extract pyiceberg scheme mapping into separate function * generalize credentials mixin test setup * remove unused import * add centralized fallback to append when merge is not supported * Revert "add centralized fallback to append when merge is not supported" This reverts commit 54cd0bc. * fall back to append if merge is not supported on filesystem * fix test for s3-compatible storage * remove obsolete code path * exclude gcs read interface tests for iceberg * add gcs support for iceberg table format * switch to UnsupportedAuthenticationMethodException * add iceberg table format docs * use shorter pipeline name to prevent too long sql identifiers * add iceberg catalog note to docs * black format * use shorter pipeline name to prevent too long sql identifiers * correct max id length for sqlalchemy mysql dialect * Revert "use shorter pipeline name to prevent too long sql identifiers" This reverts commit 6cce03b. * Revert "use shorter pipeline name to prevent too long sql identifiers" This reverts commit ef29aa7. * replace show with execute to prevent useless print output * add abfss scheme to test * remove az support for iceberg table format * remove iceberg bucket test exclusion * add note to docs on azure scheme support for iceberg table format * exclude iceberg from duckdb s3-compatibility test * disable pyiceberg info logs for tests * extend table format docs and move into own page * upgrade adlfs to enable account_host attribute * Merge branch 'devel' of https://github.com/dlt-hub/dlt into feat/1996-iceberg-filesystem * fix lint errors * re-add pyiceberg dependency * enabled iceberg in dbt-duckdb * upgrade pyiceberg version * remove pyiceberg mypy errors across python version * does not install airflow group for dev * fixes gcp oauth iceberg credentials handling * fixes ca cert bundle duckdb azure on ci * allow for airflow dep to be present during type check --------- Co-authored-by: Marcin Rudolf <[email protected]>
closes #1996